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Abstract 

Indexed languages are interesting in computational linguistics because they 
are the least class of languages in the Chomsky hierarchy that has not been 
shown not to be adequate to describe the string set of natural language sen- 
tences. We here define a class of unification grammars that exactly describe 
the class of indexed languages. 

1 Introduction 

The occurrence of purely syntactical cross-serial dependencies in Swiss-German 
shows that context-free grammars can not describe the string sets of natural lan- 
guage [Shi85]. The least class in the Chomsky hierarchy that can describe unlimited 
cross-serial dependencies is indexed grammars [AI1068]. Gazdar discuss in [Gaz88] 
the applicability of indexed grammars to natural languages, and show how they can 
be used to describe different syntactic structures. We are here going to study how 
we can describe the class of indexed languages with a unification grammar formal- 
ism. After defining indexed grammars and a simple unification grammar framework 
we show how we can define an equivalent unification grammar for any given indexed 
grammar. Two grammars are equivalent if they generate the same language. With 
this background we define a class of unification grammars and show that this class 
describes the class of indexed languages. 

2 Indexed grammars 

Indexed grammars is a grammar formalism with generative capacity between con- 
text-free grammars and context-sensitive grammars. Context-free grammars can 
not describe cross-serial dependencies due to the pumping lemma, while indexed 
grammars can. However, the class of languages generated by indexed grammars, 
-the indexed languages, is a proper subset of context-sensitive languages [Aho68]. 

Indexed grammars can be seen as a context-free grammar where we add a string 
-or stack, of indices to the nonterminal nodes in the phrase structure trees, or 
derivation trees as we will call them. Some production rules add an index to the 
beginning of the string, while the use of other production rules is dependent on 
the first index in the string. When such a production rule is applied the index 
of which it is dependent, is removed, and the rest of the index-string is kept by 
the daughter (s). In this way we may distribute information from one part of the 
derivation tree to another. The original definition of indexed grammars was given 
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by Aho [Aho68]. We are here using the definition used by Hopcroft and Ullman 
[HU79] with some minor notational variations: 

Definition 1 An INDEXED GRAMMAR G is a 5-tuple; G = {N, T, I, P, S) where 
N is a finite set of symbols, called nonterminals, 
T is a finite set of symbols, called terminals, 
/ is a finite set of symbols, called indices, 

P is a finite set of ordered pairs, each on one of the forms (A, Bf), (Af, a) or 
(A, a) where A and B are nonterminal symbols in N , a is a finite string in 
(NUT)* , and f is an index in I. An element in P is called a production rule 
and is written A — » Bf, Af — > a or A a. 

S is a symbol in N , and is called the start symbol. 

and such that N , T and I are pairwise disjoint. 

An indexed grammar G = (N, T, I, P, S) is on REDUCED FORM if each production 
in P is on one of the forms 

a) A^ Bf 

b) Af^B 

c) A-^BC 

d) A^t 

where A, B, C are in N , f is in I, and t is in (T U {e}). 

Aho showed in his original paper [AI1068] that for every indexed grammar there 
exists an indexed grammar on reduced form which generates the same language. 

To define constituent structures and derivation trees we are going to use tree 
domains: Let A+ be the set of all integers greater than zero. A tree domain D is 
a set D C Af^_ of number strings so that if x £ D then all prefixes of x are also in 
D, and for all i £ A+ and x £ A"^, if xi 6 D then xj 6 D for all j, 1 < j < i. 
The out degree d(x) of an element i in a tree domain D is the cardinality of the set 
{i I xi £]),!£ The set of terminals of D is term(D) = {x \ x £ D, d(x) = 0}. 

The elements of a tree domain are totally ordered lexicographically as follows: x ■< y 
if 2; is a prefix of y, or there exist strings z, z' , z" £ Af^_ and i,j £ A+ with i < j, 
such that x = ziz' and y = zjz" . We also define that x ~< y if x ^ y and x 7^ y. 1 

A tree domain D can be viewed as a tree graph in the following way: The 
elements of D are the nodes in the tree, e is the root, and for every x £ D the 
element xi £ D is x's child number i. A tree domain may be infinite, but we shall 
restrict attention to finite tree domains. A finite tree domain can also describe the 
topology of a derivation tree. This representation provides a name for every node 
in the derivation tree directly from the definition of a tree domain. Our definition 
of derivation trees for indexed grammars with the use of tree domains is based on 
Hayashi [Hay73]: 

Definition 2 A DERIVATION TREE based on an indexed grammar G = (N, T, I, P, S) 
is a pair (D, Cj) of a finite tree domain D and a function Cj : D — > (NI* UTU{e}) 
where 

1) Cj{e) = S 

^ee Gallier [Gal86] for more about tree domains. 
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it) Cj(x) £ N I* for every node x in D with d(x) > 0. Moreover if Cj(x) = Ay 
/or A £ TV and y £ /* and C'z(xi) = Brfi with Bi £ (TVUTU{e}) and Bi £ /* 
for every i : 1 < i < d(x) then either 

a) A Bif is a production rule in P such that d(x) = 1, / £ /, and 
Qi = Si, or 

b) Af — » B\ . . . BcHx) is a production rule in P such that f £ / where 
7 = fy', and B { = 7' if B { £ TV and B { = s if Bi £ (T U {e}), or 

c) A — ?> 5i . . . a production rule in P such that Bi = 7 «/ Bi £ TV 
and 8 - = e «/5; £ (T U {e}). 

m) Cj(x) £ (T U {e}) /or every node in D with d(x) = 0, 

T/«e symbol function; Cj ym : D (N UT), and ine index string func- 
tion; Cj dx : £) — ?> /*, are ioia/ functions on D such that if C%(x) = Ay where 
A £ (N U T U {e}) and 7 £ /* then C^ m {x) = A and Cf* : (x) = 7 for all x £ D. 

The TERMINAL STRING of a derivation tree {D, Cj) is the string Cx(xi)...Cx(x n ) 
where {x\, x n } = term(D) and Xi ~< a?j+i for all i, 1 < i < n — I. 

We also define the LICENSE FUNCTION; license : (D -term(D)) P, such that 
if A — » a is a production rule according to a), b) or c) in 11) for a node x in D, 
then license(x) = A — ?> a. 

Informally this is a traditional phrase structure tree. If we have a node with 
label Ay where A is a nonterminal symbol and 7 is a string of indices, and we use 
a production rule A — > Bf, then the node's only child gets the label Bfy. If we 
instead use a production rule A — > BC on the same node it gets two children labeled 
By and Cy respectively, or if we use a production rule A — > t where t is a terminal 
symbol, then we remove all the indices and the node's only child gets the label t. 
If we have a node labeled with Afy, where / is a index and we use a production 
rule Af B then the node's only child gets the label By. We also see that the 
terminal string is a string in T* since C'u(x) £ (T U {e}) for all x £ term(D). 

Definition 3 A string w is GRAMMATICAL with respect to an indexed grammar G 
if and only if there exists a derivation tree based on G with w as the terminal string. 
The language generated by G, L(G) is the set of all grammatical strings with respect 
to G. 

Example 1 Let G = (TV, T, I, P, S) be an indexed grammar where T = {a, b, c} is 
the set of terminal symbols, TV = {S, S' , A, B, C} is the set of nonterminal symbols, 
/ = {S,g} is the set of indices and P is the least set containing the following 
production rules: 

S S'S Ag^aA Af -> a 

S' -+ S'g Bg^bB Bf ^b 

S' -+ ABC Cg^rcC Cf^c 

Figure 1 shows the derivation tree for the string "aabbcc" based on this grammar. 
The language L(G) generated by this grammar is {a n b n c n \ n > 1}. 

We close this presentation of indexed grammars by showing a simple technical 
observation that we will use in later proofs. 

Definition 4 An indexed grammar G = (TV, T, I, P, S) has a MARKED INDEX-END 
if and only if it has one and only one production rule where the start symbol occurs 
and this rule is on the form S — > A$ where A £ TV and the index $ does not occur 
in any other production rule. 
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Figure 1: Derivation tree for the string "aabbcc" based on the grammar in Example 
1 



If an indexed grammar has a marked index-end then in any derivation tree every 
nonterminal node except the root gets a $ at the end of the index list. Since no 
rule requires that there is an empty index list, and neither $ nor the start symbol 
occurs in any other production rule, it is straight forward to construct an equivalent 
grammar with a marked index-end for any indexed grammar. 

Lemma 1 For every indexed grammar G there exists an indexed grammar with a 
marked index-end G$ such that L(G) = L(G$). 

Proof: Let G = (N, T, I, P, S) be an indexed grammar, and assume that So and $ 
do not occur in G. G$ is defined from G by adding the production rule So —> S$ 
such that So becomes the new start symbol and is added to the set of nonterminal 
symbols, and $ is added to the set of indices. Formally, if G = (N, T, I, P, S) and 
S ,$g (JVUTU/), then G$ = {N U {S },T, I U {$}, P U {{So, S$)}, So). Then G$ 
has a marked index-end, and we have to show that for any string w, w £ L(G) if 
and only if w £ L(G$). 

(=>) Let {D, Cj) be any derivation tree based on G and assume that w is its 
terminal string. From this we construct a derivation tree {D' ,C' X ) based on G$ 
as follows: First let D' = {lx \ x £ D} U {e}. Then let Cj(e) = S and let 
C'j{lx) = Ci{x)% for all x £ (D - term(D)). Let also C' x {lx) = Cj(x) for all 
x £ term(D). The derivation tree {D 1 ,Cj) has then the same terminal string as 
{D,Cj). Since no rule requires that there is an empty index list, and $ does not 
occur in any production rule in G, a production rule that is licensing a node x in 
{D,Cj), will license the node lx in {D' ,C' X ). The rule So —> S$ licenses the root. 
Then (£)', C'j) is a valid derivation tree according to Definition 2. 

(^=) Let (£)', C'j) be any derivation tree based on G% and assume that w is its 
terminal string. Since So —> S$ must license the root and $ does not occur in any 
other production rule the index symbol $ occurs at the end of the index list at every 
nonterminal node except the root in (£)', C' x ). From this derivation tree we construct 
a derivation tree {D,Cj) based on G as follows: First let D = {x \ lx £ D'}. 
Then for all x £ (D — term(D)) let Cj(x) = [3 where C'j{lx) = /?$. Let also 
Cj(x) = Gj(l«) for all x £ term(D). The derivation tree {D,Cj) has then the 
same terminal string as (£)', C' x ). Since every production rule in G$ except So —> S$ 
also is a production rule in G, the rule So —> S$ only can license the root, and $ 
does not occur in any other production rule, a production rule that licenses a node 
lx in (£)', Gj) will license the node x in {D, Cj). Then {D, Cj) is a valid derivation 
tree according to Definition 2. □ 

Notice in the proof that if G is on reduced form then G$ is also on reduced 
form. Then for any indexed grammar on reduced form there also exists an indexed 
grammar on reduced form with a marked index-end. 
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3 Unification grammars 



We are here going to give a description of a very simple unification grammar for- 
malism. The formalism itself is not particularly interesting, and it is only meant as 
a framework for the rest of this paper. The formalism is just a notational variant of 
the basic formalism used by Colban in his work on restrictions on unification gram- 
mars [Col91]. It should be easy to reformulate this in most of the known formalisms 
available. We give an informal description of feature structures in the way they are 
used here before we define the grammar formalism. 

A feature structure over a set of attribute symbols A and value symbols V is 
a four-tuple (Q,S,a,mo) where Q is a finite set of nodes, 8 : Q x A — > Q is & 
partial function, called the transition function, a : Q — > V is a partial function 
called the atomic value function, and mo : D — > Q is a function, called the name 
mapping. We will mostly omit the name-domain from the notation, so m will alone 
denote the name mapping. We extend the transition function to be a function from 
pairs of nodes and strings of attribute symbols: For every q £ Q let S(q,e) = q. 
If S(qi,ip) = q 2 and S(q 2 ,a) = q 3 then let S(qi,ipa) = q 3 for every qi,q 2 ,q 3 G Q, 
ip £ A* and a £ A 

A feature structure is describable if there for every node is a path from a named 
node to the node. This means that for every q £ Q there is an x £ D and a ip £ A* 
such that S(m(x), ip) = q. A feature structure is atomic if every node with an atomic 
value has no out-edges. This means that for every node q £ Q, S(q, a) is not defined 
for any a £ A if a(q) is defined. A feature structure is acyclic if it does not contain 
attribute cycles. This means that for every node q £ Q, S(q, ip) = q if and only if 
ip = e. A feature structure is well defined if it is describable, atomic and acyclic. 
When nothing else is said we require that feature structures are well defined in the 
rest of this paper. 

We are going to use equations to describe feature structures, in a way where 
feature structure satisfies equations. A feature structure satisfies the equation 

Xllpl = X 2 1p2 (1) 

if and only if S(m(xi), ipi) = S(m(x 2 ), ip 2 ), and the equation 

xiipi = v (2) 

if and only if a(S(m(xi),ipi)) = v, where xi,x 2 £ D, ipi,ip 2 £ A* and v £ V. 
We only allow equations on those two forms. This means that there is no typing, 
quantification, implication, negation, or explicit disjunction as we may find in other 
unification grammars and feature logics. 

If E is a set of equations of the above form and M is a well defined feature 
structure such that M satisfies every equation in E then we say that M satisfies E 
and we write 

M \=E (3) 

A set of equations E is consistent if there exists a well defined feature structure 
that satisfies E. 

The notation of the grammar formalism is borrowed from Lexical Functional 
Grammar [KB82]. 

Definition 5 A simple unification grammar G over a set of attribute symbols 
A and value symbols V is a 5-tuple (N, T, P, L, S) where 

N is a finite set of symbols, called nonterminals, 

T is a finite set of symbols, called terminals, 
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P is a finite set of production rules 



A -> A 1 ... A n (4) 

Ei E n 

where n > 1, Ao, A n £ ^V, and /or a// i, 1 < i < n, Ei is a finite set with 
equations on the forms 

= ti;^' (5) 
tM" = v (6) 

where £ A*, ip" £ A + and v £ V 2 

L is a finite set of lexicon rules 

A -> t (7) 

E 

where A £ N , t £ (T U {e}), and -E" «s a finite set of equations on the form 

tM" = « (8) 

where ip" £ A + and i> £ 

S is a symbol in N , called start symbol. 

As an example (9) is a production rule. 

A ^ B C C (9) 

t=4 t=4 «i 1=1 «3 a 4 

•f c = i>i t a 'i a 3 =4 ai t a 3 = ^2 

Definition 6 ^4 CONSTITUENT STRUCTURE (c-structure) based on a simple unifi- 
cation grammar G = (N, T, P, L, S) is a triple (D, Cu, Eu) where 

D is a finite tree domain, 

C u ■ D (N U T U {e}) is a function, 

En '■ (D — {e}) —?> T is a function where T is the set of all equation sets in P 
and L, 

such that C'u(x) £ (T U {e}) for all x £ term(D), Cu(s) = S, and for all x £ 
(D — term(D)), if d(x) = n then 

Cu{x) -+ Cu{xl) ... Cu{xn) (10) 
Eu(xl) Eu(xn) 

is a production or lexicon rule in G. 

The TERMINAL STRING of a constituent structure is the string Cu(%i)---Cu(%n) 
where {x\, x n } = term(D) and Xi ~< a?j+i for all i, 1 < i < n. 

To get equations that can be satisfied by a feature structure we must instantiate 
the up and down arrows in the equations from the rule set. We substitute them 
with nodes from the c-structure such that the nodes become the domain of the name 
mapping. For this purpose we define the '-function such that E' u (xi) = Eu(xi)[x/ ^ 
, xi/ ],]. We see that the value of the function E' u is a set of equations that feature 
structures may satisfy. 



denotes here a "1" or a 4- 



6 



Definition 7 The c-structure (D, K, E) GENERATES the feature structure M if and 
only if 

A c-structure may generate different feature structures. The tree domain will 
form a name set for feature structures that this union generates. A string is gram- 
matical if this union is consistent. 

Definition 8 A string w is GRAMMATICAL with respect to a simple unification 
grammar G if and only if there exists a c-structure based on G with w as the terminal 
string and the c-structure generates a well defined feature structure. The language 
generated by G, L(G) is the set of all grammatical strings with respect to G. 

4 Prom Indexed Grammars to Unification Gram- 
mars 

We are here going to define a simple unification grammar that is equivalent to 
a given indexed grammar. The main idea is that we use feature structures to 
represent the index string more or less like a (nested) stack. The use of feature 
structures to represent stacks for indexed grammars is also used by Gazdar and 
Mellish [GM89] although they do not go into much details. Here we define a function 
that transforms any indexed grammar on reduced form with a marked index-end 
to a simple unification grammar, such that the new grammar generates the same 
language. 

Definition 9 Let G$ = (N, T, I, P, S) be an indexed grammar on reduced form 
with a marked index-end. We then define the simple unification grammar U(G%) as 
(N, T, P' , L' , S) where P' and L' are the least sets where 

a) For each rule on the form A — » Bf in P, P' has a production rule on the 
form 

A -> B (12) 

], next =t 
], idx = / 

6) For each rule on the form Af — » B in P, P' has a production rule on the 
form 

A -> B (13) 

j- next =], 
-[ idx = / 

c) For each rule on the form A BC in P , P' has a production rule on the 
form 

A -> B C (14) 
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d) For each rule on the form A — » a in P, L' has a lexicon rule on the form 



A -> a (15) 


// p is a production rule in G$ then U (p) is the production or lexicon rule in 
U(G$) defined by a), b) c) or d). 

Notice that there is a one-to-one relation between the production rules in G$, 
and production/lexicon-rules in U(G$). We will later define a class of unification 
grammars which can be defined by production and lexicon rules on the forms used 
here. But first we will show that G$ and U(G$) are equivalent. 

Lemma 2 For every indexed grammar G$ on reduced form with a marked index 
end, L(G $ ) = L(K(G $ )). 

Proof: We have to show that for any string w, w £ L(G$) if and only if id £ 
L(K(G $ )). 

{=$■) For every w £ L(G$) there exists a derivation tree (D,Cj) for w based 
on G$. We have to show that based on U(G$) there exist c-structure with w as 
the terminal string which generates a well defined feature structure. We define the 
c-structure (D, Cu, Eu) on the same tree domain D. 

For every nonterminal node x in D we have a unique production rule license(x) 
in the indexed grammar, and for each production rule in the indexed grammar 
we have a unique corresponding production or lexicon rule U(license(x)) in U(G$) 
according to Definition 9. If 

U (license(x)) = A -> A x ... A n (16) 

Ei E n 

then let Cu(xi) = Ai and Eu(xi) = Ei for all 1 < i < n, and let Cu{x) = Aq. 
Then we have a valid c-structure and since Cu{x) = C'f 1 " 1 {x) for all x £ D, it also 
has w as terminal string. Now we only have to show that all the equations in the 
c-structure are satisfied by a well defined feature structure. 

For any finite string 7 over an alphabet / we may define a feature structure where 
the node set is the union of all suffixes of 7 and all symbols occurring in 7. Here we 
make a distinction between the singleton string of a symbol, and the symbol itself, 
such that they are regarded as two distinct nodes. For all non-empty string nodes, 
let the idx attribute point to the first symbol of the string and let the next attribute 
point to the rest of the string when we remove the first symbol, ie. S(fj' , idx) = / 
and S(fj' , next) = 7' for every non-empty suffix fj' of 7 where / £ /. Let also the 
atomic value of each symbol-node be the symbol itself, ie. a(f) = /. Else, let no 
more attributes or atomic values be defined, and in particular let S(e, next), S(e, idx) 
and a(e) be undefined. We extend the definition directly to any finite set of strings 
over an alphabet. With any name-mapping to the string nodes defined from this 
finite set, this is a well defined feature structure since each nonempty string has a 
unique first symbol, and a unique suffix with length one less than the string itself. 

Let M be the feature structure defined as described on the set of all index strings 
that occur in the derivation tree (D,Cj), with the mapping of each nonterminal 
node in the tree domain to the index-string of that node: m(x) = C l j x (x). This is 
a well defined feature structure. We now have to show that all the equations in the 
c-structure are satisfied by the feature structure M . We have three different cases 
to consider: 
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Assume for a node x that Cj(x) = Ay where 7 is an index-string and that 
license(x) = A ^ Bf. Then Cj(xl) = Bfy, m(x) = 7 and m(xl) = /7- From 
U(license(x)) we have that E[f(xl) = ne:rf = x, xl idx = /}, which is satisfied 
by the feature structure M since S(fy, next) = 7, and a(S(fj, idx)) = /. 

Assume for a node x that Cj(x) = A/7 where /7 is an nonempty index- 
string and that license(x) = Af — > B. Then Cj(xl) = By, m(x) = fy and 
m(xl) = 7. From U(license(x)) we have that E[f(xl) = {2; ne:rf = a;l, a; idx = /}, 
which is satisfied by the index-string feature structure M since S(fy, next) = 7, and 
a(S(ff, idx)) = f. 

Assume for a node x that Cj(x) = Ay where 7 is an index-string and that 
license(x) = A — > BC . Then Cj(xl) = By, Cj(x2) = Cy and m(a;) = m(a;l) = 
m(x2) = 7. From U(license(x)) we have that E[f(xl) = {x = xl} and E' u (x e l) = 
{2; = «2}, which is satisfied by the index-string feature structure M . 

We do not have to consider the nodes which license production rules with ter- 
minal symbols since all the terminal nodes have empty equation sets. Then all 
the equations in the c-structure are satisfied by the feature structure M and then 
weL(K(G $ )). 

(■$=) We will here use the function idx-lst : Q — > V* defined on any well 
defined acyclic feature structure as follows: idx-lst(q) = a(q) if a(q) is defined. 
If S(q, idx) and S(q, next) are both defined then idx-lst(q) is the concatenation of 
idx-lst(S(q, idx)) followed by idx-lst(S(q, next)). Else idx-lst(q) = e. We restrict our 
attention to its prefix with $ as last symbol: Let idx-lst$ : Q — > V* be the function 
such that: idx-lst$(q) is the smallest prefix of idx-lst(q) with $ as the last symbol. 
If idx-lst(q) does not contain any $ then idx-lst$(q) = e. 

For every w 6 L(U(G$)) there exists a c-structure (D,Cu,Eu) for w based on 
U(G$) which generates a well defined feature structure. We define the derivation 
tree (D, Cj) for w based on G$ on the same tree domain D. Let C'f 1 " 1 {x) = C'u(x) 
for all nodes in D and C l j x (x) = idx-lst$(m(x)) for all nonterminal nodes in D 
except for the root s for which we define C l j x (e) to be the empty string. This 
derivation tree has w as terminal string, and we just have to show that this is a 
valid derivation tree according to Definition 2. 

Since G$ has a marked index-end, the only production rule where the start 
symbol occurs is S — > A$, for an A 6 N . This gives the following corresponding 
production rule in U(G$): 

S A (17) 

], next =t 
], idx = $ 

which is the only production rule in U(G$) where the start symbol occurs. Then 
Cj(e) = S which is the start symbol of G$. Here we also have that idx-lst$(m(l)) = 
$ and Cw(l) = A so that Cj(l) = A$ and S — > A$ licenses the root node. For all 
the other nonterminal nodes in the tree domain we have four cases to consider: 

Assume for a nonterminal node x except for the root node that Cu{x) = A and 
idx-lst$(m(x)) = 7. Then Cj(x) = Ay. Assume also that there exists a production 
rule in U(G$) from Definition 9 a), such that Cu(xl) = B, E' u (xl) = {xl next = 
x, xlidx = /} and xl has no sister nodes. Since $ only occurs in the one production 
rule with the start symbol, / ^ $. Then idx-lst$(m(xl)) = fy and Cj(xl) = Bfy. 
From the reverse of Definition 9 a), there exists a production rule A — > Bf in G$, 
which licenses x. 

Assume for a nonterminal node x except for the root node that Cu{x) = A 
and idx-lst$(m(x)) = fy. Then Cj(x) = Afy. Assume also that there exists a 
production rule in U(G$) from Definition 9 6), such that Cu(xl) = B, E^(xl) = 
{x next = xl, x idx = /} and xl has no sister nodes. Since $ only occur in the 



9 



one production rule with the start symbol, / ^ $. Then idx-lst$(m(xl)) = 7 and 
Cj(xl) = Bj. By the reverse of Definition 9 6), there exist a production rule 
Aj — > B in G$, which licenses x. 

Assume for a nonterminal node x except for the root node that Cu(x) = A 
and idx-lst$(m(x)) = 7. Then Cj(x) = Aj. Assume also that there exist a pro- 
duction rule in U(G$) from Definition 9 c), such that d(x) = 2, Cu(xl) = B, 
Cu(x2) = G, E[f(xl) = {x = «1} and E' u (x2) = {x = x2}. Then icir-/s£$(m(a;l)) = 
idx-lst$(m(x2)) = 7, Cj(xl) = 57 and Cx(x2) = G7 By the reverse of Definition 9 
c), there exist a production rule A — > BC in G$, which licenses x. 

Assume for a nonterminal node x except for the root node that Cu(x) = A and 
idx-lst$(m(x)) = 7. Then Cj(x) = Aj. Assume also that there exists a lexicon rule 
in U(G$) from Definition 9 d), such that d(x) = 1, C'u(xl) = t and E[f(xl) = 0. 
Then Cj(xl) = t. By the reverse of Definition 9 d), there exist a production rule 
A — > t in G$ which licenses a;. 

We then have a valid derivation tree with the same terminal string as the c- 
structure and then w 6 L(G$). □ 

Example 2 Let G = (TV, T, I, P, S) be an indexed grammar where T = {d} is the 
set of terminal symbols, TV = {S, A, B, C, C , D} is the set of nonterminal symbols, 
/ = {$,/,#} is the set of indices and P is the least set containing the following 
production rules: 

S A$ B->CC 

A^Bj Cg^C C CC 

B -> Bg Cf -> D D -> d 

This grammar is on reduced form with a marked index-end. The simple unification 
grammar U{G) as given in Definition 9 is then the 5-tuple (TV, T, P' , L 1 , S) where 
P' is the least set containing the following production rules: 

C 



a -+ c c 



s - 


■» A 


B - 


+ c 




], next =t 








], idx = $ 






A - 


-> 5 


C - 


+ C" 




\, next =t 




j- ne:rf =4- 




J, zcfe = / 




j- zcfe = (/ 


B - 


-> 5 


C - 


■» D 








j- next =], 




J, zcfe = g 




-[ idx = / 



and _L' contains one single lexicon rule: 

D 



Figure 2 shows the derivation tree for the string "dddd" based on the indexed 
grammar G together with the c-structure and the feature structure for the same 
string string based on the simple unification grammar U(G). This shows that the 
string "dddd" is both in L(G) and in L{U(G)). The language generated by G and 
U{G) is {rf 2 " \ n>\}. 
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a) Derivation tree 
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c) Feature structure 
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• idx = f 
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■ next =T 
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t D 
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t ° 
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t D 




1 next 
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T k/jc = 
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T idx = 


/ 


T ;Vjjc = 


/ 


T idx = 


/ 


d 




1 




d 




d 

























b) C-structure 



Figure 2: Derivation tree (a) for the string "dddd" based on the grammar G in 
Example 2, together with the c-structure (b) and feature structure (c) for the same 
string based on the grammar U (G) . 



5 A Unification Grammar Formalism for Indexed 
Languages 

We are here going to define a version of the simple unification grammar that de- 
scribes the class of indexed languages. Just to be precise, a class of languages, Cr 
over a countable set T of symbols is a set of languages, such that each language 
L G Cr is a subset of S* where E is a finite subset of T. The class Cr(GF) of 
languages that a grammar formalism GF describes is the set of all languages L' 
over T such that there exists a grammar G in GF where L(G) = L' . The class of 
indexed languages is then the set of languages such that there for each language 
exist a indexed grammar that generates the language. We assume that T is the set 
of all terminal symbols that we use and drop T as subscript. 

Definition 10 A Unification grammar for Indexed languages, UQX is a 
simple unification grammar where 

a) each equation set in the production rules is on one of the three forms 

. e = {t=;>, 



E = {4- next =t, I idx = /}, 
E = {t next =1, t idx = /} 



where f is any value symbol, and next and idx are the same two attribute 
symbols for all equations in all production rules in UQX, 
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6) each lexicon rule has en empty equation set. 

Lemma 3 The class of languages C(UQI) contains the class of indexed languages. 

Proof: Aho [Aho68] showed that for every indexed language there exists an indexed 
grammar on reduced form which generates the language. From Lemma 1 and its 
proof we have that for every indexed grammar G on reduced form there exists an 
indexed grammar on reduced form with a marked index-end G$, such that L(G) = 
L(G$). The simple unification grammar U(G$) defined from the indexed grammar 
on reduced form with a marked index-end in Definition 9 is an UQI grammar. From 
Lemma 2 we have that L(G$) = L(U(G$)). Then every indexed language can be 
generated by an UQI grammar. □ 
We shall now show that every UQI grammar generates an indexed language, 
but to do this we need some technical results. First it is easy to see that every UQI 
grammar can be formulated with rules only on the forms used in Definition 9 a)-d). 
We define the reduced form for this. 

Definition 11 A UQI grammar is on REDUCED FORM if and only if every produc- 
tion rule is on one of the three following forms: 

A -> B A -> B A -> B C 

I next ^ -[next=l t=4 t=4 (18) 

], idx = / t idx = / 

Lemma 4 For every UQI grammar there is an equivalent grammar on reduced 
form. 

Proof: Using the techniques from the standard proof for normal form for context- 
free grammars, it is straight forward to replace each production rule in the original 
grammar not on reduced form with a set of new lexicon rules and production rules 
on reduced form. This can be done such that one instance of an original rule 
corresponds to the net effect of combining one ore more of the new rules. This is 
possible since we allow the empty string in lexicon rules. □ 
To make this formalism more directly comparable to indexed grammars with a 
marked index-end we use what we will call a sink-mapped root: 

Definition 12 A UQI grammar {N,T, P, L, S) has a SINK-MAPPED ROOT if and 
only if it has one and only one production rule where the start symbol occurs and 
this rule is on the form 

S A (19) 

], next =t 
], idx = $ 

where A £ N and the value symbol $ does not occur in any other production rule. 

The value symbol $ will form some kind of a blockade in the feature structure 
since it does not occur in any other production rule, hence no other node in the 
c-structure will be mapped to the same node in the feature structure as the root of 
the c-structure. 

What we are doing here is to put a mark at the bottom of the stack of indices, 
in the way the nested stack is represented as a feature structure. We also want 
to map the root of the c-structure to the "sink" of the feature structure when we 
follow the next attribute. 

Lemma 5 For every UQI grammar G there exists a UQI grammar with a sink- 
mapped root G' such that L(G) = L(G'). 



12 



Proof: First we show how we from any U QI grammar G may define a U QI grammar 
with a sink-mapped root G' . After this we show that for any string w, w £ L(G) if 
and only if id £ L(G'). 

Let any UQI grammar G = (N, T, P, L, S) be given, and assume that So, S' and 
S £ are neither terminal nor nonterminal symbols in G, and that $ is a value symbol 
not used in G. The grammar G' is defined by adding the following production and 
lexicon rules to the rules we have in G: 

i) Let the following be two production rules: 

So S' (20) 

], next =t 
], %dx = $ 



S' -> S S £ (21) 



it) For each / £ V used in any production rule in G, let the following be a 
production rule: 

S' S' (22) 

], next =t 
], idx = / 



in) Let the following be a lexicon rule: 

S £ -> e (23) 



Complete G' by adding So, S' and S £ to the nonterminal symbols, and let So be 
the start symbol of G' . We see that G' is a UQI grammar with a SINK-MAPPED 
ROOT. Notice also that if G is on reduced form so is the new grammar. 3 

Now we have to show that for any string w, w £ L(G) if and only if w £ L(G'). 

(=>) We show this direction in two steps: First we define something that we 
call a canonical feature structure for c-structures based on UQI grammars. This is 
done such that if the c-structure generates a well defined feature structure at all, 
then it is also generating the canonical feature structure. After this definition we 
show how we from a c-structure based on G, together with its canonical feature 
structure can construct a c-structure together with a feature structure based on the 
grammar G' . This is done such that the two c-structures have the same terminal 
string and if the terminal string is in L(G) so is it in L(G') also. 

Let (D, K, E) be any c-structure based on a UQI grammar G such that it gen- 
erates a feature structure. The canonical feature structure (Q,S,a,m) for the c- 
structure is defined as follows: Let first Q + be the set of all sequences of nodes 
from the c-structure with at most 2n + 1 nodes in each sequence, where n is the 
height of the c-structure. Then let the name mapping function m be defined on 
Q-l- by top-down induction on the nodes in the c-structure: First let the mapping 
of the root node, m(e) be the sequence of n + 1 e's, <e, e, e>, where again n is 



3 The use of S £ in rule (21) together with rule (23) where it will label the mother of a node with 
the empty string is only done because we want to stay in the domain of grammars on reduced 
form when G is on reduced form. This definition could be simplified if we did not want this. 
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the height of the c-structure. Now assume that m(x) is defined for a node x in the 
c-structure. Then for each daughter xi of x, let 

m(xi) = m(x) if t=^£ E 1 (xi) 

m(xi) = pop(m(x)) if ^ next =].£ E'(xi) (24) 

m(xi) = add(xi,m(x)) if \. next E'(xi) 

where pop of any nonempty sequence is the sequence we get by removing the 
first element, pop(<x\, X2, ■ x^>) =<*2, *fe>, and add of a single element and a 
sequence is the sequence we get by adding the single element to the beginning of the 
sequence, add(x, <x\, Xk>) =<x, x\, Xk>. Since the root node is mapped to 
the sequence of n + 1 e's, pop and add may not go out of their domain and therefore 
is m well defined. 

Extend now the set Q + such that all the value symbol used in the c-structure also 
are elements in Q + . Then let the partial function 8 + : Q + x {next, idx} — » Q + be 
defined such that 8 + (q, next) = pop(q) for all nonempty sequences q £ Q + , and let 
8+(q, next) be undefined when q is the empty sequence. Moreover let 8 + (q, idx) = / 
for the value symbol / if and only if there exists a node x in the c-structure such 
that either \. idx = / £ E(x), or ^ idx = / £ E(xi) for a daughter xi oi x. This is the 
only place where inconsistency may occur and we will later see that it will not occur 
if the c-structure generates any feature structure at all. We extend the definition 
of the 8 + to pairs of nodes and strings of the attribute symbols as described in the 
definition of feature structures in the beginning of section 3. 

Now, let us shrink the definitions of Q + and 8 + such that we get a well defined 
feature structure. First let Q C Q + be the set of all nodes that is reachable from 
a named node, formally Q = {q \ 3x £ D, ip £ {next, idx}* : 8 + (m(x),ip) = q}. 
Then, we restrict 8 to the new domain: 8 = 8 + fl (Q x {next, idx} x Q). Finally, 
let a(f) = / for all value symbol used in the c-structure. We now have a feature 
structure and it is describable and acyclic directly from the definition of Q and 8. 
It is also atomic since 8 is not defined on any feature symbol node, and a is only 
defined on feature symbol nodes. Moreover, it satisfies all the equations from the 
c-structure after we have instantiated the up and down arrows. We will now show 
that if the c-structure generates any well defined feature structure so will it generate 
the well defined canonical one also. 

Let M' = (Q' , 8' , a' , m 1 ) be any well defined feature structure which the c- 
structure generates, and assume that we have the canonical feature structure as 
described. From the fact that the c-structure generates a feature structure, and from 
the definition of the canonical feature structure we have that if m(x) = m(y) for 
any two nodes x and y in the c-structure then m' (x) = m'(y). Now we may define a 
function h : Q — » Q' from the nodes in the canonical feature structure to the nodes in 
M' , such that m' (x) = h(m(x)) for all nodes x in the c-structure. Assume then that 
we don't have a well defined canonical feature structure because of inconsistency 
in it definition. This means that there exist two instantiated equations, x idx = / 
and y idx = /' from the c-structure where m(x) = m(y) but / ^ /'. However, 
then m' (x) = m'(y), and inconsistency must also occur with respect to M' and the 
c-structure can not generate any well defined feature structure. Then the canonical 
feature structure must be consistent defined, and since it is also describable, acyclic 
and atomic it is well defined. Since it also satisfies all the equations in the c-structure 
it is generated by the c-structure. 

Now we have a well defined canonical feature structure for each c-structure based 
on any UQI grammar if the c-structure generates a feature structure. Notice that 
8(<e>,idx) is not defined for the canonical feature structure. This due to the 
mapping of the root in the c-structure to the sequence of n + 1 e's, where n is the 
height of the c-structure. With this height it is only possible to pop of n — 1 e's 
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according to definition of the name mapping (24), and since S(q, idx) is only defined 
for q if there exist a node x such that m(x) = q, 6(<s>, idx) can not be defined. 

Assume now that w 6 L(G) for a grammar G. Then we have a c-structure 
for w based on G which generates a well defined feature structure. Then it is 
also generating a canonical feature structure M = (Q,S,a,m) as described above. 
For this feature structure we extend the definition of 8 and a as follows: First let 
8(<e>, idx) = $ and let a($) = $. For all sequences q of e's such that 8(q,idx) 
is not defined, let 8(q,idx) = / for any value symbol / which occurs in the c- 
structure. When we construct the new c-structure based on G' the old nodes keep 
their mapping values. 

We construct a new c-structure for w based on G' by the following steps: First 
add a new node on the top of the c-structure by applying the production rule 

(21) . This give us also a new sister node for the old root node. Map the two new 
nodes to the same node in the extended canonical feature structure as the old root 
node. This secures that the equations in the production rule (21) is satisfied by 
the extended feature structure. The new sister node labeled with S e may only be 
a mother of a terminal node labeled with the empty string such that the terminal 
string is still w. Now add n nodes above the present root node by applying the 
generic production rule (22) n — 1 times and production rule (20) on the topmost 
node. This top node will be the root node in the new c-structure and it is now 
labeled with the start symbol in G' . When applying the generic production rule 

(22) , let / = a(8(m(xl), idx)) for each new node x where it is applied. The new 
nodes are each mapped to the sequence of k e's, where k is the node's distance from 
the new root node. In this way the new root node is mapped to the empty sequence, 
the daughter of the root node is mapped to <e>, and so on. Since 8(<e>, idx) = $ 
the equations in production rule (20) is satisfied by the feature structure. Moreover 
since / = a(8(m(xl), idx)) for each node x where the production rule (22) is applied 
and 8(q, next) = pop(q), all the equations is satisfied by the feature structure. We 
then have a c-structure based on G' with w as terminal string, and this c-structure 
generates a well defined feature structure. Then w 6 L(G'). 

(■$=) Assume that w 6 L(G') for a grammar G. Then there is a c-structure with 
category So in the root, and a sequence of derivations down to a node with category 
S, where each intermediate node has category 5". This has been constructed by 
first using production rule (20) and then a sequence of zero or more applications 
of production rule (22) before production rule (21) gives the node with category S. 
Every node above the first node with category S has only one child, except the first 
which has an additional daughter, labeled with S £ . This daughter is the mother 
of a single terminal node labeled with the empty string. Then we can remove 
all nodes above the node labeled S and still have the same terminal string w in 
the c-structure. The new c-structure will have a root-node with category S, and 
only production rules from the grammar G are used. Since the original c-structure 
generates a feature structure, so does the new one. Then w 6 L(G). □ 

Now we have the necessary technical results to show that every language in 
C(UQI) is an indexed language. We do this in two steps. 

Lemma 6 For any UQI grammar G on reduced form with a sink-mapped root, 
there exists an indexed grammar Gj such that U(Gj) = G. 

Proof: Assume that G = (N, T, P, L, S) is a UQI grammar on reduced form with 
a sink-mapped root. Then let Gj = (N, T, V , P' , S) be an indexed grammar where 
V is all the value symbols occurring in G, and P' is constructed from P and L by 
reversing Definition 9 a)-d). This can bee done since G is on reduced form and there 
exist a one to one relation between the production rules in the indexed grammar and 
the production and lexicon rules in the unification grammar defined there. Since G 
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has a sink-mapped root the start symbol will occur in one and only one production 
rule together with a unique value symbol. Then Gj has a marked index-end and 
U{Gi) = G. □ 

Lemma 7 Every language in C(UQI) is an indexed language. 

Proof: From Lemma 4 and Lemma 5 we have for any language in C(UQI) that there 
exist &UQX grammar G on reduced form with a sink-mapped root that generates the 
language. From Lemma 6 we have an indexed grammar Gj such that U{Gj) = G. 
By Lemma 2 we have that L(Gj) = L(G). Then we have an indexed grammar for 
all languages in C(UQI). □ 
From Lemma 3 and Lemma 7 we then have the following result: 

Theorem 1 : The class C(UQI) is the class of indexed languages. 
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